Chunyuan Li

My research centers on multimodal intelligence, with a focus on large-scale language and vision training. Key contributions include LLaVA and its model family series, as well as foundational early work such as GroundingDINO, GLIP, GLIGEN, Florence, and Oscar.

My experience includes research roles at xAI, ByteDance, and Microsoft Research, Redmond. I earned my PhD in machine learning from Duke University under the guidance of Prof. Lawrence Carin, where my doctoral research explored deep generative models. I have also served the community as an Area Chair for NeurIPS, ICML, ICLR, EMNLP, TMLR and a Guest Editor of IJCV on ``the promises and dangers of large vision models''.

news

2025	Grok-3: Visual understanding and Realtime video in voice-mode.
2024	Exploring the boundaries of fully open-source VLMs to establish a mature recipe, documented in Blog Series and Github LLaVA-NeXT, LLaVA-OneVision, LLaVA-Video, LLaVA-Critic Developing the proprietary industry-leading VLM in image and video understanding: Seed-VL-1.5
Oct/Nov, 2023	LLaVA is upgraded: LLaVA-1.5 achieves SoTA on 11 benchmarks among open-source VLMs. It utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses prior SoTA that use billion-scale data. [Project] [Paper] [Github] [Demo] [Model Zoo] LLaVA-Interactive: Experience the future of human-AI multimodal interaction with an all-in-one demo for image chat, segmentation, generation and editing. [Project] [Paper] [Github] [Demo] LLaVA-Plus expands the capabilities of LLaVA by learning to use external tools for creating multimodal agents. [Project] [Paper] [Github] [Demo]
September 20, 2023	A 110-page paper is released to share our perspective on multimodal models: ``Multimodal Foundation Models: From Specialists to General-Purpose Assistants''. This is based our CVPR 2023 Tutorial. [Note on Large Multimodal Models] [Slides] [YouTube] [Bilibili]
June 1, 2023	LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. NeurIPS 2023 Datasets and Benchmarks Track (Spotlight)
April 17, 2023	Visual Instruction Tuning with GPT-4! We release LLaVA, a Large Language-and-Vision Assistant towards multimodal GPT-4 level capabilities. NeurIPS 2023 (Oral Presentation) [Project] [Paper] [Github] [Demo] [Data] [Model] [Scaling Note]
April 7, 2023	Instruction Tuning with GPT-4! a "first attempt" to use GPT-4 data for LLM self-instruct tuning. [Paper] [Github] [My Learnings]
March, 2023	CVPR 2023: REACT improves foundation models on various vision tasks by customizing them with retrieval-augmented multimodal knowledge [Code] (Highlights, 2.5%) GLIGEN enables a new capability for frozen text-to-image generation models: open-set grounding. [Demo] [Code] [YouTube] X-Decoder: a generalist decoder for pixel, image and language [Demo] [Code]
Feb, 2023	CVPR2023 Workshop and Challenge on the 2nd Computer Vision in the Wild (CVinW). For those who are new to this topic, please check out the CVinW Reading List . [Workshop] [SGinW Challenge] [RF100 Challenge]
Oct 23, 2022	ECCV 2022 Workshop and Challenge on the 1st Computer Vision in the Wild (CVinW). Please check out the videos of this event at [YouTube] [BiliBili]. [Workshop] [ICinW Challenge] [ODinW Challenge]
Oct 17, 2022	"Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends", A 100-page survey paper in Foundations and Trends® in Computer Graphics and Vision
Sep 16, 2022	NeurIPS 2022: K-LITE (Oral, 1%), ELEVATER and FocalNet. A team effort to push CVinW. ; [CVPR Tutorial] K-LITE demonstrates the effectiveness of external knowledge to improve language-image models in zero-/few-shot task transfer ELEVATER is a platform with 20 image classification and 35 object detection public datasets for evaluating language-image models in task-level visual transfer. [Benchmark Website] FocalNet [paper, code, demo, blog] - SoTA on COCO object detection with a simple attention-free architecture
Mar 25, 2022	Upcoming events as a co-organizer: CVPR 2022 tutorial on Recent Advances in Vision-and-Language Pre-training ECCV 2022 workshop on Computer Vision in the Wild ICDM 2022 workshop on Foundation Models in Vision and Language
Mar 1, 2022	CVPR 2022: UniCL [paper, code, demo] introduces the unified language-image-label contrast; UniCL is an academic-friendly version of [Microsoft Florence]: GLIP [paper, code, demo] Oral Presentation, CVPR Best Paper Finalist RegionCLIP [paper, code, demo] Lafite [paper, code]
June 17, 2021	EsViT chieves SoTA 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with an order magnitude of higher throughput. [GitHub]